Add exception handler on HTTP/2 parent channel to suppress WARN logs by jeet1995 · Pull Request #48890 · Azure/azure-sdk-for-java

jeet1995 · 2026-04-21T19:19:50Z

Problem

Customers see noisy Netty WARN logs in HTTP/2 scenarios:

An exceptionCaught() event was fired, and it reached at the tail of the pipeline.
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed with error(-104): Connection reset by peer

Root Cause

In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent and child channels have different pipeline structures:

HTTP/1.1 pipeline (single channel — no leak to TailContext):

SslHandler → HttpClientCodec → ChannelOperationsHandler → [TAIL]
                                         ↑
                            Catches exceptions, bridges to
                            Reactor subscriber. Exception
                            never reaches TailContext.

HTTP/2 parent channel pipeline (BEFORE fix — leak to TailContext):

SslHandler → Http2FrameCodec → Http2MultiplexHandler → [TAIL]
                                                          ↑
                                            No handler catches it.
                                            TailContext logs WARN.

HTTP/2 parent channel pipeline (AFTER fix):

SslHandler → Http2FrameCodec → Http2MultiplexHandler → Http2ParentChannelExceptionHandler → [TAIL]
                                                                   ↑
                                                       Consumes Exception types.
                                                       Log level based on connection state.
                                                       Error types propagate to TailContext.

HTTP/2 child stream channel pipeline (unchanged):

H2ToHttp11Codec → IdleStateHandler → ChannelOperationsHandler → [TAIL]
                                              ↑
                               Same as HTTP/1.1 — catches exceptions,
                               bridges to Reactor subscriber.

Design: Connection-State-Based Log Level

The handler consumes Exception types on the parent channel (no exception type filtering within Exception). Error types (e.g., OutOfMemoryError) are not consumed — they propagate to TailContext for standard Netty handling.

The log level for Exception types is determined by connection state:

DEBUG — when activeStreams == 0 OR !channelActive.
WARN — when active streams exist on a live channel, or when the active stream count cannot be determined (null).

Active streams	Channel active	Log level	Rationale
0	true/false	DEBUG	Idle connection — no in-flight requests affected
>0	false	DEBUG	Channel already dead — streams will fail via subscriber
>0	true	WARN	Live requests may be affected
null (codec unavailable)	true	WARN	Cannot determine state — safe default
null (codec unavailable)	false	DEBUG	Channel already dead regardless

Active stream count is retrieved via Http2FrameCodec.connection().numActiveStreams() on the same parent channel pipeline. Returns null (not -1) if the codec is unavailable, making the unknown-state case explicit. Failures to retrieve the stream count are logged at DEBUG.

Why no exception type filtering?

By the time any exception reaches our handler, all upstream handlers (Http2FrameCodec, Http2MultiplexHandler) have already handled the protocol actions (GOAWAY, stream reset, child channel error delivery). The exception reaching TailContext is an echo of already-handled work, regardless of type. Connection state (active streams + channel activity) is the only dimension that determines whether the exception has diagnostic value.

Why OR (not AND) for the DEBUG condition?

Either condition alone is sufficient:

activeStreams == 0 — no in-flight requests affected, regardless of channel state
!channelActive — channel is already dead, any active streams will fail through their Reactor subscribers independently

Why no `ctx.close()`?

The handler does NOT close the channel. Connection lifecycle is owned by:

Netty transport layer — detects TCP RST/EOF, transitions channel to inactive
Http2FrameCodec — processes GOAWAY, resets streams
Http2Pool (reactor-netty) — evicts connections based on !channel.isActive(), GOAWAY, maxIdleTime, maxLifeTime

Our handler is the last in the pipeline with the least protocol context. Closing here would race with reactor-netty's pool management and could prematurely kill connections after non-fatal errors.

Why propagate `Error` but not `Exception`?

Error types (OOM, StackOverflowError) represent JVM-level failures that should not be silently consumed. Re-throwing Exception to TailContext provides no functional value — TailContext just logs WARN and swallows, which is what our handler already does with better context (connection state in the log message).

Testing

9 EmbeddedChannel unit tests with production-matching pipeline (Http2FrameCodec → Http2MultiplexHandler → handler):

Test	What it proves
`withoutHandler_exceptionReachesTail`	BEFORE: exception reaches TailContext → WARN
`withHandler_zeroActiveStreams_consumedAtDebug`	0 active streams → consumed at DEBUG
`withHandler_exceptionDoesNotCloseChannel`	Handler does NOT close channel
`withHandler_runtimeException_zeroActiveStreams_consumed`	RuntimeException also consumed (no type filtering)
`withHandler_npe_zeroActiveStreams_consumed`	NPE also consumed (no type filtering)
`withHandler_activeStreams_consumedAtWarn`	Active streams → consumed at WARN
`withHandler_activeStreams_channelNotClosed`	Active streams + exception → channel NOT closed
`withHandler_codecAbsent_fallsBackToWarnPath`	Codec absent → null stream count → safe WARN path
`withHandler_errorNotConsumed_propagatesToTail`	Error types propagate to TailContext (not consumed)

WARN path log (`activeStreams=1, channelActive=true`)

2026-04-23 14:08:17,278       [main] WARN  com.azure.cosmos.implementation.http.Http2ParentChannelExceptionHandler - Exception on HTTP/2 parent connection [channel=[id: 0xembedded, L:embedded - R:embedded], activeStreams=1, channelActive=true, clientVmId=n/a]
java.io.IOException: Connection reset by peer
at com.azure.cosmos.implementation.http.Http2ParentChannelExceptionHandlerTest.withHandler_activeStreams_consumedAtWarn(Http2ParentChannelExceptionHandlerTest.java:149) ~[test-classes/:?]
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:577) ~[?:?]

Note: The !channelActive branch cannot be unit-tested with EmbeddedChannel because disconnect() tears down the pipeline before fireExceptionCaught can reach handlers. In production, exceptionCaught() fires while the channel is transitioning to inactive.

Impact

Handler only overrides exceptionCaught() — Netty @Skip optimization bypasses it for all hot-path events
Handler does NOT close the channel
Error types are propagated, not consumed
Exceptions with active streams on a live channel still log at WARN
Unknown stream count (codec unavailable) takes safe WARN path

Copilot

Pull request overview

Adds a Netty channel handler to suppress noisy “exceptionCaught reached tail of pipeline” WARN logs on HTTP/2 parent (TCP) connections in Cosmos’ Reactor Netty transport, while preserving WARN-level signal when exceptions may impact in-flight HTTP/2 streams.

Changes:

Install an HTTP/2 parent-channel exceptionCaught handler from ReactorNettyClient when HTTP/2 is enabled.
Add Http2ParentChannelExceptionHandler that consumes parent-channel exceptions and logs at DEBUG vs WARN based on active stream count and channel activity.
Add EmbeddedChannel-based unit tests covering exception consumption behavior, and update changelog entry.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java	Adds logic to install the new handler onto the HTTP/2 parent channel pipeline.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java	New handler that consumes parent-channel exceptions and logs based on connection state.
sdk/cosmos/azure-cosmos/CHANGELOG.md	Documents the fix in the unreleased section.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandlerTest.java	New unit tests validating the handler’s exception consumption behavior.
sdk/cosmos/azure-cosmos-tests/pom.xml	Enables surefire tests and includes trailing whitespace changes.

jeet1995 · 2026-04-21T19:56:37Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-21T19:57:01Z

Azure Pipelines successfully started running 1 pipeline(s).

In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent channel pipeline has no ChannelOperationsHandler (unlike HTTP/1.1), so TCP-level exceptions like Connection reset by peer (ECONNRESET) propagate to Netty's TailContext, which logs them as WARN. This adds Http2ParentChannelExceptionHandler to the parent channel via doOnConnected (accessing channel.parent()). The handler consumes exceptions at DEBUG level WITHOUT closing the channel or altering connection lifecycle, matching HTTP/1.1 logging behavior. Changes: - Handler logs cause.toString() (not getMessage()) for null-safe diagnostics - Defensive try-catch for duplicate handler name on concurrent stream creation - Before/after verified with EmbeddedChannel unit tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…toString(), update changelog Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-04-21T21:07:11Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-21T21:07:32Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-04-21T21:26:47Z

@sdkReviewAgent

Address Bhaskar's review: add two tests covering the else branch where activeStreams > 0 on an active channel, exercising the WARN log path. - withHandler_activeStreams_consumedAtWarn: creates an active H2 stream via codec.connection().local().createStream(), fires an exception, and verifies it is consumed (does not reach TailContext). - withHandler_activeStreams_channelNotClosed: same setup, verifies the handler does not close the channel even with active streams. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-04-21T22:01:47Z

✅ Review complete (32:05)

No new comments — existing review coverage is sufficient.

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

When Http2FrameCodec is absent from the pipeline, getActiveStreamCount() returns -1. Since -1 != 0 and channelActive == true, the handler takes the safe WARN path. This test covers that fallback behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-04-21T22:16:47Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-21T22:16:53Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

jeet1995 · 2026-04-21T23:35:58Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-21T23:36:23Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel

LGTM - Thanks!

…ug log in catch - Change getActiveStreamCount() to return Integer (nullable) instead of int with -1 sentinel. null explicitly means 'could not determine' and takes the safe WARN path. (Addresses Fabian's review) - Add logger.debug in catch block so codec retrieval failures are observable instead of silently swallowed. - Add Error guard in exceptionCaught: Error types (OOM, SOF) propagate to TailContext instead of being consumed. (Addresses Xinlian's review) - Add withHandler_errorNotConsumed_propagatesToTail test. - Update Javadoc to reflect Exception-only consumption and Error passthrough. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-04-22T23:18:27Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-22T23:18:56Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12

LGTM, thanks

In reactor-netty's H2 path, doOnConnected fires once per TCP connection and connection.channel() IS the parent channel (channel.parent() is null). The previous code assumed doOnConnected fires for child/stream channels where channel.parent() would return the TCP parent. Fix: resolve the H2 parent as channel.parent() ?? channel, handling both the observed case (parent=null, channel IS the parent) and the alternate case (parent!=null, install on parent). Verified with integration test: - Linux/epoll with TCP RST proxy (SO_LINGER=0, 30s idle timeout) - 4.79.1 baseline: TailContext WARN appeared (Connection reset by peer) - Fixed build: WARN suppressed, handler logged at DEBUG (activeStreams=0) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-04-23T12:31:59Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-23T12:32:24Z

Azure Pipelines successfully started running 1 pipeline(s).

doOnConnected fires for the parent TCP channel in reactor-netty's H2 path, so connection.channel() IS the parent. No need for channel.parent() resolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FabianMeiswinkel

LGTM

jeet1995 · 2026-04-23T13:30:20Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-23T13:30:58Z

Azure Pipelines successfully started running 1 pipeline(s).

- Add local/remote address to WARN and DEBUG log messages for diagnostic parity with RNTBD connection loggers - Mark handler @ChannelHandler.Sharable with singleton INSTANCE (handler is stateless - no instance fields) - Update ReactorNettyClient to use INSTANCE instead of new - Update tests to use INSTANCE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Matches PartitionProcessor/HealthChecker patterns - avoids SLF4J inline formatting issues. Channel.toString() provides L:/R: addresses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-04-23T15:34:47Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-23T15:35:15Z

Azure Pipelines successfully started running 1 pipeline(s).

@sharable

Resolve vmId lazily via ClientTelemetry.getMachineId(null) on first access from non-event-loop thread. Store as immutable field in the @sharable handler singleton. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@sharable

Remove lazy singleton pattern (getOrCreateInstance) that could call ClientTelemetry.getMachineId() on the Netty event loop (5s blocking). Instead, create handler eagerly in configureChannelPipelineHandlers() which runs on the caller's setup thread. The @sharable handler instance is captured by the doOnConnected lambda. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12

LGTM, thanks

Remove all blocking calls. Add ClientTelemetry.getCachedMachineId() which reads a volatile field populated by getMachineId() during client init. Handler reads it at log time - pure volatile read, zero blocking. Restores static INSTANCE singleton (handler is stateless again). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mbhaskar · 2026-04-23T17:10:56Z

LGTM, Thanks

jeet1995 · 2026-04-23T17:34:55Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-23T17:35:24Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings April 21, 2026 19:19

jeet1995 requested a review from kirankumarkolli as a code owner April 21, 2026 19:19

jeet1995 added the Cosmos label Apr 21, 2026

jeet1995 requested review from a team as code owners April 21, 2026 19:19

jeet1995 mentioned this pull request Apr 21, 2026

Add exception handler on HTTP/2 parent channel to suppress WARN logs #48687

Closed

Copilot started reviewing on behalf of jeet1995 April 21, 2026 19:22 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

jeet1995 and others added 5 commits April 21, 2026 17:03

Revert azure-cosmos-tests pom.xml changes

5f84575

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Clarify why duplicate-name is the only possible IAE in handler install

2af3e52

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move static utility method to bottom of test file

fc3fe78

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address Copilot review: fix comment accuracy, remove duplicate cause.…

2a3b5b2

…toString(), update changelog Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 force-pushed the AzCosmos_Http2ParentChannelExceptionHandler branch from d68fa5c to 2a3b5b2 Compare April 21, 2026 21:05

mbhaskar approved these changes Apr 21, 2026

View reviewed changes

Comment thread ...s/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java

xinlian12 reviewed Apr 21, 2026

View reviewed changes

Comment thread ...s/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java Outdated

FabianMeiswinkel reviewed Apr 22, 2026

View reviewed changes

Comment thread ...s/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java Outdated

FabianMeiswinkel approved these changes Apr 22, 2026

View reviewed changes

xinlian12 approved these changes Apr 23, 2026

View reviewed changes

Simplify handler installation — use channelPipeline directly

8495236

doOnConnected fires for the parent TCP channel in reactor-netty's H2 path, so connection.channel() IS the parent. No need for channel.parent() resolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FabianMeiswinkel approved these changes Apr 23, 2026

View reviewed changes

kushagraThapar reviewed Apr 23, 2026

View reviewed changes

Comment thread ...s/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java Outdated

Comment thread ...s/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java

jeet1995 and others added 3 commits April 23, 2026 11:14

Use string append for log messages, pass channel directly

76a3760

Matches PartitionProcessor/HealthChecker patterns - avoids SLF4J inline formatting issues. Channel.toString() provides L:/R: addresses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use channel= prefix in log messages for clarity

8166748

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 and others added 2 commits April 23, 2026 11:51

Add clientVmId to exception handler log messages

2a2a75e

Resolve vmId lazily via ClientTelemetry.getMachineId(null) on first access from non-event-loop thread. Store as immutable field in the @sharable handler singleton. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 approved these changes Apr 23, 2026

View reviewed changes

jeet1995 and others added 2 commits April 23, 2026 12:48

Use n/a fallback for unresolved vmId in logs

c13dd12

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar approved these changes Apr 23, 2026

View reviewed changes

jeet1995 merged commit cb1519a into Azure:main Apr 24, 2026
103 checks passed

Conversation

jeet1995 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

HTTP/1.1 pipeline (single channel — no leak to TailContext):

HTTP/2 parent channel pipeline (BEFORE fix — leak to TailContext):

HTTP/2 parent channel pipeline (AFTER fix):

HTTP/2 child stream channel pipeline (unchanged):

Design: Connection-State-Based Log Level

Why no exception type filtering?

Why OR (not AND) for the DEBUG condition?

Why no ctx.close()?

Why propagate Error but not Exception?

Testing

WARN path log (activeStreams=1, channelActive=true)

Impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeet1995 commented Apr 21, 2026

Uh oh!

azure-pipelines Bot commented Apr 21, 2026

Uh oh!

jeet1995 commented Apr 21, 2026

Uh oh!

azure-pipelines Bot commented Apr 21, 2026

Uh oh!

xinlian12 commented Apr 21, 2026

Uh oh!

Uh oh!

xinlian12 commented Apr 21, 2026

Uh oh!

jeet1995 commented Apr 21, 2026

Uh oh!

azure-pipelines Bot commented Apr 21, 2026

Uh oh!

Uh oh!

jeet1995 commented Apr 21, 2026

Uh oh!

azure-pipelines Bot commented Apr 21, 2026

Uh oh!

Uh oh!

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

Uh oh!

jeet1995 commented Apr 22, 2026

Uh oh!

azure-pipelines Bot commented Apr 22, 2026

Uh oh!

xinlian12 left a comment

Choose a reason for hiding this comment

Uh oh!

jeet1995 commented Apr 23, 2026

Uh oh!

azure-pipelines Bot commented Apr 23, 2026

Uh oh!

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

Uh oh!

jeet1995 commented Apr 23, 2026

Uh oh!

azure-pipelines Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

jeet1995 commented Apr 23, 2026

Uh oh!

azure-pipelines Bot commented Apr 23, 2026

Uh oh!

xinlian12 left a comment

Choose a reason for hiding this comment

Uh oh!

mbhaskar commented Apr 23, 2026

jeet1995 commented Apr 21, 2026 •

edited

Loading

Why no `ctx.close()`?

Why propagate `Error` but not `Exception`?

WARN path log (`activeStreams=1, channelActive=true`)